AITopics | information extraction

Collaborating Authors

information extraction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

3d7025dc9bd4c8b6fb1eef80cc618008-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-11-2026, 14:47:53 GMT

computational linguistic, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Asia > Singapore (0.05)
Asia > Indonesia > Bali (0.04)
(19 more...)

Genre:

Research Report > New Finding (0.93)
Overview (0.68)

Industry:

Banking & Finance > Trading (0.47)
Education > Curriculum > Subject-Specific Education (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

63943ee9fe347f3d95892cf87d9a42e6-Paper-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 10:33:44 GMT

computational linguistic, extraction, proceedings, (13 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

HunyuanOCR Technical Report

Hunyuan Vision Team, null, Lyu, Pengyuan, Wan, Xingyu, Li, Gengluo, Peng, Shangpin, Wang, Weinong, Wu, Liang, Shen, Huawen, Zhou, Yu, Tang, Canhui, Yang, Qi, Peng, Qiming, Luo, Bin, Yang, Hower, Zhang, Xinsong, Zhang, Jinnian, Peng, Houwen, Yang, Hongming, Xie, Senhao, Zhou, Longsha, Pei, Ge, Wu, Binghong, Yan, Rui, Wu, Kan, Yang, Jieneng, Wang, Bochao, Liu, Kai, Zhu, Jianchen, Jiang, Jie, Linus, null, Hu, Han, Zhang, Chengquan

arXiv.org Artificial IntelligenceDec-12-2025

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.19575

Country: Asia (0.28)

Genre:

Workflow (1.00)
Research Report (1.00)

Industry:

Education (1.00)
Media (0.67)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Neurosymbolic Information Extraction from Transactional Documents

Hemmer, Arthur, Coustaty, Mickaël, Bartolo, Nicola, Ogier, Jean-Marc

arXiv.org Artificial IntelligenceDec-11-2025

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

constraint, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s10032-025-00530-0

2512.09666

Country:

Europe (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.72)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

Nonesung, Surapon, Jaknamon, Teetouch, Chaiophat, Sirinya, Nitarach, Natapong, Wittayasakpan, Chanakan, Sirichotedumrong, Warit, Na-Thalang, Adisai, Pipatanakul, Kunat

arXiv.org Artificial IntelligenceDec-5-2025

We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2511.04479

Country: North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Information Extraction From Fiscal Documents Using LLMs

Aggarwal, Vikram, Kulkarni, Jay, Mascarenhas, Aditi, Narang, Aakriti, Raman, Siddarth, Shah, Ajay, Thomas, Susan

arXiv.org Artificial IntelligenceNov-25-2025

Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.

information, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.10659

Country: Asia > India > Karnataka (0.27)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan's Historical Celebrities

Hao, Junjie, Wang, Chun, Qiao, Ying, Zuo, Qiuyue, Song, Qiya, Ma, Hua, Gao, Xieping

arXiv.org Artificial IntelligenceNov-24-2025

Large language models and knowledge graphs hold broad application potential in the field of historical culture, facilitating the excavation, research, and comprehension of cultural heritage. Taking Hunan's historical celebrities emerging from modern Huxiang culture as a case, pre-trained large models can assist researchers in rapidly extracting specific historical figure information from literature--including basic details, life events, and social relationships--and constructing structured knowledge graphs, thereby supporting related research. Currently, systematic data collection on Hunan's historical celebrities remains scarce. Moreover, general-purpose large language models often exhibit insufficient domain knowledge extraction accuracy and weak structured output capabilities in such low-resource scenarios. Therefore, this paper proposes a supervised fine-tuning approach for domain-specific large models to enhance the quality and efficiency of information extraction regarding Hunan's historical celebrities. Specifically, this paper first designs a fine-grained schema-guided instruction fine-tuning template for the Hunan's historical celebrities domain. Using this template, we construct an instruction fine-tuning dataset, addressing the current lack of instruction datasets in domain-specific model fine-tuning. Second,we conducted parameter-efficient instruction fine-tuning on four publicly available large language models--Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct--using the proposed instruction dataset, and established evaluation criteria for assessing their performance in character information extraction. Experimental results demonstrate that the performance of all four base models significantly improved after domain-specific fine-tuning. Among them, Qwen3-8B achieved the best performance after training with 100 samples and 50 fine-tuning iterations, scoring 89.3866 on the evaluation metrics. This research offers new insights for fine-tuning vertical large models tailored to regional historical and cultural domains, holding significant implications for promoting the cost-effective application of large models and knowledge graphs in the field of historical and cultural heritage. Introduction With the rapid advancement of large language models (LLMs), unprecedented opportunities have emerged for the in-depth exploration, systematic research, and widespread dissemination of Huxiang culture. Simultaneously, this presents new challenges for the digital transformation of traditional cultural resources[1].

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.17012

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.87)

Industry:

Health & Medicine (0.46)
Government (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Balancing Natural Language Processing Accuracy and Normalisation in Extracting Medical Insights

Tworek, Paulina, Bargieł, Miłosz, Khan, Yousef, Pełech-Pilichowski, Tomasz, Mikołajczyk, Marek, Lewandowski, Roman, Sousa, Jose

arXiv.org Artificial IntelligenceNov-21-2025

Extracting structured medical insights from unstructured clinical text using Natural Language Processing (NLP) remains an open challenge in healthcare, particularly in non-English contexts where resources are scarce. This study presents a comparative analysis of NLP low-compute rule-based methods and Large Language Models (LLMs) for information extraction from electronic health records (EHR) obtained from the Voivodeship Rehabilitation Hospital for Children in Ameryka, Poland. We evaluate both approaches by extracting patient demographics, clinical findings, and prescribed medications while examining the effects of lack of text normalisation and translation-induced information loss. Results demonstrate that rule-based methods provide higher accuracy in information retrieval tasks, particularly for age and sex extraction. However, LLMs offer greater adaptability and scalability, excelling in drug name recognition. The effectiveness of the LLMs was compared with texts originally in Polish and those translated into English, assessing the impact of translation. These findings highlight the trade-offs between accuracy, normalisation, and computational cost when deploying NLP in healthcare settings. We argue for hybrid approaches that combine the precision of rule-based systems with the adaptability of LLMs, offering a practical path toward more reliable and resource-efficient clinical NLP in real-world hospitals.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2511.15778

Country: Europe > Poland (0.50)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.87)
Health & Medicine > Health Care Providers & Services (0.69)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

Liu, Maoqi, Fang, Quan, Yang, Yang, Zhao, Can, Cai, Kaiquan

arXiv.org Artificial IntelligenceNov-18-2025

Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.1263

Country:

Asia (0.68)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.48)

Industry:

Transportation > Air (1.00)
Transportation > Infrastructure & Services (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

A Reasoning Paradigm for Named Entity Recognition

Huang, Hui, Chen, Yanping, Huang, Ruizhang, Lin, Chuan, Qin, Yongbin

arXiv.org Artificial IntelligenceNov-18-2025

Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This "cognitive shortcutting" leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at https://github.com/HuiResearch/ReasoningIE.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.11978

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback